Search CORE

10 research outputs found

Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation

Author: Awan Ammar Ahmad
Bedorf Jeroen
Chu Ching-Hsiang
Panda Dhabaleswar K.
Subramoni Hari
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 25/10/2018
Field of study

TensorFlow has been the most widely adopted Machine/Deep Learning framework. However, little exists in the literature that provides a thorough understanding of the capabilities which TensorFlow offers for the distributed training of large ML/DL models that need computation and communication at scale. Most commonly used distributed training approaches for TF can be categorized as follows: 1) Google Remote Procedure Call (gRPC), 2) gRPC+X: X=(InfiniBand Verbs, Message Passing Interface, and GPUDirect RDMA), and 3) No-gRPC: Baidu Allreduce with MPI, Horovod with MPI, and Horovod with NVIDIA NCCL. In this paper, we provide an in-depth performance characterization and analysis of these distributed training approaches on various GPU clusters including the Piz Daint system (6 on Top500). We perform experiments to gain novel insights along the following vectors: 1) Application-level scalability of DNN training, 2) Effect of Batch Size on scaling efficiency, 3) Impact of the MPI library used for no-gRPC approaches, and 4) Type and size of DNN architectures. Based on these experiments, we present two key insights: 1) Overall, No-gRPC designs achieve better performance compared to gRPC-based approaches for most configurations, and 2) The performance of No-gRPC is heavily influenced by the gradient aggregation using Allreduce. Finally, we propose a truly CUDA-Aware MPI Allreduce design that exploits CUDA kernels and pointer caching to perform large reductions efficiently. Our proposed designs offer 5-17X better performance than NCCL2 for small and medium messages, and reduces latency by 29% for large messages. The proposed optimizations help Horovod-MPI to achieve approximately 90% scaling efficiency for ResNet-50 training on 64 GPUs. Further, Horovod-MPI achieves 1.8X and 3.2X higher throughput than the native gRPC method for ResNet-50 and MobileNet, respectively, on the Piz Daint cluster.Comment: 10 pages, 9 figures, submitted to IEEE IPDPS 2019 for peer-revie

arXiv.org e-Print Archive

Crossref

A Novel Tensor-Expert Hybrid Parallelism Approach to Scale Mixture-of-Experts Training

Author: Awan Ammar Ahmad
Bhatele Abhinav
He Yuxiong
Rajbhandari Samyam
Ruwase Olatunji
Singh Siddharth
Publication venue
Publication date: 11/03/2023
Field of study

A new neural network architecture called Mixture-of-Experts (MoE) has been proposed recently that increases the parameters of a neural network (the base model) by adding sparsely activated expert blocks, without changing the total number of floating point operations for training or inference. In theory, this architecture allows us to train arbitrarily large models while keeping the computational costs same as that of the base model. However, beyond 64 to 128 experts blocks, prior work has observed diminishing returns in the test accuracies of these MoE models. Thus, training high quality MoE models requires us to scale the size of the base models, along with the number of expert blocks. In this work, we propose a novel, three-dimensional, hybrid parallel algorithm that combines tensor, expert, and data parallelism to enable the training of MoE models with 4-8x larger base models than the current state-of-the-art -- DeepSpeed-MoE. We propose memory optimizations in the optimizer step, and communication optimizations that eliminate redundant movement of data. Removing these redundancies provides a speedup of nearly 21%. When training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs, our optimizations significantly improve the peak half precision flop/s from 20% to 27%

arXiv.org e-Print Archive

Intercloud Message Exchange Middleware

Author: Ammar Ahmad Awan
Muhammad Bilal Amin
Sungyoung Lee
Wajahat Ali Khan
Publication venue
Publication date: 11/04/2020
Field of study

ABSTRACT Cloud Interoperability has been a core issue pertaining Intercloud and Cloud Federation. Several vendor-based proprietary solutions and open-source middleware are present for the resolution; however, these solutions are highly coupled to particular cloud environments. For heterogeneous clouds to exist in an interoperable environment, the need of a vendor-independent, secure and reliable message exchange middleware is critical. In this paper, considering general cloud architecture, we are presenting a Publish-Subscribe based middleware for Intercloud Message Exchange. Intercloud Message Exchange is an implementation of Data Distribution Service (DDS). DDS's reliable pub-sub messaging in conjunction with our devised Information Model can be a novel candidate for messaging domain of Intercloud Interoperability Standards. This Information Model also hosts an OWL based Cloud Resource Description Ontology, utilized by cloud environments for resource cataloguing and possible matchmaking prior to workload migration between heterogeneous clouds

CiteSeerX

DeepSpeed-Chat: Easy, Fast and Affordable RLHF Training of ChatGPT-like Models at All Scales

Author: Aminabadi Reza Yazdani
Awan Ammar Ahmad
Che Shuai
He Yuxiong
Holmes Connor
Kurilenko Lev
Li Conglong
Qin Heyang
Rajbhandari Samyam
Rasley Jeff
Ruwase Olatunji
Smith Molly
Song Shuaiwen Leon
Tanaka Masahiro
Wu Xiaoxia
Wyatt Michael
Yao Zhewei
Zhang Minjia
Zhou Zhongzhu
Publication venue
Publication date: 02/08/2023
Field of study

ChatGPT-like models have revolutionized various applications in artificial intelligence, from summarization and coding to translation, matching or even surpassing human performance. However, the current landscape lacks an accessible, efficient, and cost-effective end-to-end RLHF (Reinforcement Learning with Human Feedback) training pipeline for these powerful models, particularly when training at the scale of billions of parameters. This paper introduces DeepSpeed-Chat, a novel system that democratizes RLHF training, making it accessible to the AI community. DeepSpeed-Chat offers three key capabilities: an easy-to-use training and inference experience for ChatGPT-like models, a DeepSpeed-RLHF pipeline that replicates the training pipeline from InstructGPT, and a robust DeepSpeed-RLHF system that combines various optimizations for training and inference in a unified way. The system delivers unparalleled efficiency and scalability, enabling training of models with hundreds of billions of parameters in record time and at a fraction of the cost. With this development, DeepSpeed-Chat paves the way for broader access to advanced RLHF training, even for data scientists with limited resources, thereby fostering innovation and further development in the field of AI.Comment: 14 pages, 7 figure

arXiv.org e-Print Archive

A Hybrid Tensor-Expert-Data Parallelism Approach to Optimize Mixture-of-Experts Training

Author: Awan Ammar Ahmad
Bhatele Abhinav
He Yuxiong
Rajbhandari Samyam
Ruwase Olatunji
Singh Siddarth
Publication venue: Association for Computer Machinery (ACM)
Publication date: 21/06/2023
Field of study

Mixture-of-Experts (MoE) is a neural network architecture that adds sparsely activated expert blocks to a base model, increasing the number of parameters without impacting computational costs. However, current distributed deep learning frameworks are limited in their ability to train high-quality MoE models with large base models. In this work, we present DeepSpeed-TED, a novel, threedimensional, hybrid parallel algorithm that combines data, tensor, and expert parallelism to enable the training of MoE models with 4–8× larger base models than the current state-of-the-art. We also describe memory optimizations in the optimizer step, and communication optimizations that eliminate unnecessary data movement. We implement our approach in DeepSpeed and achieve speedups of 26% over a baseline (i.e. without our communication optimizations) when training a 40 billion parameter MoE model (6.7 billion base model with 16 experts) on 128 V100 GPUs.https://doi.org/10.1145/3577193.359370

Digital Repository at the University of Maryland

DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale

Author: Aminabadi Reza Yazdani
Awan Ammar Ahmad
He Yuxiong
Li Conglong
Rajbhandari Samyam
Rasley Jeff
Yao Zhewei
Zhang Minjia
Publication venue
Publication date: 21/07/2022
Field of study

As the training of giant dense models hits the boundary on the availability and capability of the hardware resources today, Mixture-of-Experts (MoE) models become one of the most promising model architectures due to their significant training cost reduction compared to a quality-equivalent dense model. Its training cost saving is demonstrated from encoder-decoder models (prior works) to a 5x saving for auto-aggressive language models (this work along with parallel explorations). However, due to the much larger model size and unique architecture, how to provide fast MoE model inference remains challenging and unsolved, limiting its practical usage. To tackle this, we present DeepSpeed-MoE, an end-to-end MoE training and inference solution as part of the DeepSpeed library, including novel MoE architecture designs and model compression techniques that reduce MoE model size by up to 3.7x, and a highly optimized inference system that provides 7.3x better latency and cost compared to existing MoE inference solutions. DeepSpeed-MoE offers an unprecedented scale and efficiency to serve massive MoE models with up to 4.5x faster and 9x cheaper inference compared to quality-equivalent dense models. We hope our innovations and systems help open a promising path to new directions in the large model landscape, a shift from dense to sparse MoE models, where training and deploying higher-quality models with fewer resources becomes more widely possible.Comment: This paper is published at ICML 2022: https://proceedings.mlr.press/v162/rajbhandari22

arXiv.org e-Print Archive

DeepSpeed Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Author: Aminabadi Reza Yazdani
Awan Ammar Ahmad
He Yuxiong
Li Cheng
Li Du
Rajbhandari Samyam
Rasley Jeff
Ruwase Olatunji
Smith Shaden
Zhang Minjia
Zheng Elton
Publication venue
Publication date: 30/06/2022
Field of study

The past several years have witnessed the success of transformer-based models, and their scale and application scenarios continue to grow aggressively. The current landscape of transformer models is increasingly diverse: the model size varies drastically with the largest being of hundred-billion parameters; the model characteristics differ due to the sparsity introduced by the Mixture-of-Experts; the target application scenarios can be latency-critical or throughput-oriented; the deployment hardware could be single- or multi-GPU systems with different types of memory and storage, etc. With such increasing diversity and the fast-evolving pace of transformer models, designing a highly performant and efficient inference system is extremely challenging. In this paper, we present DeepSpeed Inference, a comprehensive system solution for transformer model inference to address the above-mentioned challenges. DeepSpeed Inference consists of (1) a multi-GPU inference solution to minimize latency while maximizing the throughput of both dense and sparse transformer models when they fit in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU and NVMe memory in addition to the GPU memory and compute to enable high inference throughput with large models which do not fit in aggregate GPU memory. DeepSpeed Inference reduces latency by up to 7.3X over the state-of-the-art for latency-oriented scenarios and increases throughput by over 1.5x for throughput-oriented scenarios. Moreover, it enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25x larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over

50\%

of A6000 peak)

arXiv.org e-Print Archive

CUDA-Aware OpenSHMEM: Extensions and Designs for High Performance OpenSHMEM on GPU Clusters

Author: Akshay Venkatesh
Ammar Ahmad Awan
Ching-Hsiang Chu
Cong
Cunningham
Dhabaleswar K. Panda
Duato
Hamidouche
Hamidouche
Hari Subramoni
Khaled Hamidouche
Oden
Olivier
Potluri
Potluri
Potluri
UPC Consortium
Wang
Publication venue: 'Elsevier BV'
Publication date
Field of study

Crossref